Transcriptome profiling of human thymic and peripheral blood CD4 + and CD8+ T cells, using RNA-seq
Some basic info of the chosen expression dataset
| contact_address | contact_city | contact_country | contact_department | contact_email | contact_institute |
|---|---|---|---|---|---|
| Kirkeveien 166, Laboratoriebygget | OSLO | Norway | Department of Medical Genetics | b.a.lie@medisin.uio.no | OUS |
| celltype | tissue | patient | |
|---|---|---|---|
| CD4_thymus_1 | CD4 | thymus | 1 |
| CD4_thymus_2 | CD4 | thymus | 2 |
| CD4_thymus_3 | CD4 | thymus | 3 |
| CD4_thymus_4 | CD4 | thymus | 4 |
| CD4_bloodinfant_1 | CD4 | bloodinfant | 1 |
| CD4_bloodinfant_2 | CD4 | bloodinfant | 2 |
| CD4_bloodinfant_3 | CD4 | bloodinfant | 3 |
| CD4_bloodinfant_4 | CD4 | bloodinfant | 4 |
| CD4_bloodadult | CD4 | bloodadult | NA |
| CD8_thymus_1 | CD8 | thymus | 1 |
| CD8_thymus_2 | CD8 | thymus | 2 |
| CD8_thymus_3 | CD8 | thymus | 3 |
| CD8_thymus_4 | CD8 | thymus | 4 |
| CD8_thymus_5 | CD8 | thymus | 5 |
| CD8_bloodinfant_1 | CD8 | bloodinfant | 1 |
| CD8_bloodinfant_2 | CD8 | bloodinfant | 2 |
| CD8_bloodinfant_3 | CD8 | bloodinfant | 3 |
| CD8_bloodadult | CD8 | bloodadult | NA |
##
## The downloaded binary packages are in
## /var/folders/4w/5ghh1f910_zfbnfv6h5rst9w0000gn/T//Rtmpn2JddV/downloaded_packages
This is the short glaze of the newly defined-group table after refine the samples from 17 to 15.
| celltype | tissue | patient | |
|---|---|---|---|
| CD4_thymus_1 | CD4 | thymus | 1 |
| CD4_thymus_2 | CD4 | thymus | 2 |
| CD4_thymus_3 | CD4 | thymus | 3 |
| CD4_thymus_4 | CD4 | thymus | 4 |
| CD4_bloodinfant_1 | CD4 | bloodinfant | 1 |
This is a short glaze of what we start with:
The 15 samples are kept, The cell-type category is composed of CD4 and CD8, and there are 8 CD4 samples and 7 CD8 samples in the dataset. 2 samples of “bloodadult” tissue type was removed.
| ensembl_gene_id | hgnc_symbol | CD4_thymus_1 | CD4_thymus_2 | CD4_thymus_3 | |
|---|---|---|---|---|---|
| ENSG00000000003 | ENSG00000000003 | TSPAN6 | 2.750699 | 4.395522 | 1.618822 |
| ENSG00000000419 | ENSG00000000419 | DPM1 | 37.402365 | 31.998891 | 36.502047 |
| ENSG00000000457 | ENSG00000000457 | SCYL3 | 18.826214 | 20.721749 | 23.585623 |
| ENSG00000000460 | ENSG00000000460 | C1orf112 | 34.115816 | 23.810148 | 31.870986 |
| ENSG00000000938 | ENSG00000000938 | FGR | 3.268688 | 4.664636 | 2.780002 |
In order to correct the p-values, I decided to go with the BH correction method, this method is used in class. I tried two model design in this assignment the first one is the simple model regarding the cell types. In the paper, the expriment focus on the CD4 and CD8 T cells. This would be an obvious appraoch of designing the Data Model.
The simple version linear model that only take cell type as the variable
| (Intercept) | samples$celltypeCD8 |
|---|---|
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 0 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
| 1 | 1 |
Ranked genes with Top p-value
| ensembl_gene_id | hgnc_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 9800 | ENSG00000172116 | CD8B | 280.216105 | 140.008071 | 20.23733 | 0 | 1.00e-07 | 3.425836 |
| 6043 | ENSG00000138835 | RGS3 | 41.228723 | 26.976475 | 12.48887 | 0 | 2.48e-05 | 2.632282 |
| 7408 | ENSG00000153563 | CD8A | 493.908635 | 236.540282 | 11.87527 | 0 | 3.26e-05 | 2.511982 |
| 13955 | ENSG00000237506 | RPSAP15 | -2.117976 | 1.160512 | -11.52623 | 0 | 3.65e-05 | 2.436896 |
| 14138 | ENSG00000242071 | RPL7AP6 | -6.380788 | 3.853027 | -11.26642 | 0 | 3.69e-05 | 2.377555 |
Short summary of # of genes that pass the thresholds
The complex version model that takes cell type and tissue type as its variables.
To create a different linear model with a little bit more complexity. I used two arguments, one is the previous selected cell type and the other from the MDS plot I selected the tissues.
| (Intercept) | samples$celltypeCD8 | samples$tissuethymus |
|---|---|---|
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 0 | 0 |
| 1 | 1 | 1 |
| 1 | 1 | 1 |
| 1 | 1 | 1 |
| 1 | 1 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 0 |
Ranked genes with Top p-value
This is the resulting table with ranked genes with Top p-value
| ensembl_gene_id | hgnc_symbol | logFC | AveExpr | t | P.Value | adj.P.Val | B | |
|---|---|---|---|---|---|---|---|---|
| 9800 | ENSG00000172116 | CD8B | 278.623826 | 140.008071 | 21.51238 | 0 | 1.00e-07 | 4.884591 |
| 7408 | ENSG00000153563 | CD8A | 486.339551 | 236.540282 | 15.09044 | 0 | 4.50e-06 | 4.271212 |
| 6043 | ENSG00000138835 | RGS3 | 41.451709 | 26.976475 | 13.07518 | 0 | 1.59e-05 | 3.913343 |
| 2600 | ENSG00000106028 | SSBP1 | -46.072426 | 85.107571 | -12.97505 | 0 | 1.59e-05 | 3.892086 |
| 13955 | ENSG00000237506 | RPSAP15 | -2.103834 | 1.160512 | -11.91344 | 0 | 3.79e-05 | 3.641291 |
Short summary of # of genes that pass the thresholds
The above plot shows the P value of genes between the two models. Model #1 with only cell-type is colored in Orange, Model #2 with cell type + tissue is colored in Blue.
By reading this graph, we can conclude that the Multiple Parameter Data Model works better for my dataset, and I will keep using the second model for the future study.
Calculate differential expression using the Quasi liklihood model with coef=“cell-type”
|
|
|
|
Short summary of genes that pass the threshold
Using the MeanVar Plot to explore the mean-variance relationship. Only the trended dispersion is used under the quasilikelihood pipeline.
Calculate dispersion to observe variance deviates from the mean. Biological Coefficient of Variation (BCV) and dispersion is a measure of how much variation there in the samples.
Here is a summary of the number of Up regulated and down regulated genes:
Use the quasi-liklihood models to fit the data and QLF -test to test for differential expression
The MA plot shows the difference between two conditions by graphing the average intensity by log ratio
Here is several things to be mensioned in this plot:
There are multiple outliners which has a significant value of the logCPM (average log expression), therefore I assign xlim and ylim to zoom in to the plot to be able to observe whether the distribution of the up and down regulated genes is as expected.
This is the volcano plot using EnhancedVolcano package, the
HeatMaps of top hit genes using Limma Model#2 (accounting for cell-type and tissue variability)
Generally we can observe the same cell type with same tissue type samples are definitely clustered in groups.
By reading the first heatmap we can observe that with in the Cell type groups, different tissue type shows differences in there gene expression pattern, with the second heatmap, within the same tissue, the gene of different cell type also express differently.
up_genes <- rownames(qlf_output_hits$table)[qlf_output_hits$table$logFC >= 0
& qlf_output_hits$table$FDR <= 0.05]
up_genes <- normalized_count_data[up_genes, ]
down_genes <- rownames(qlf_output_hits$table)[qlf_output_hits$table$logFC <= 0
& qlf_output_hits$table$FDR <= 0.05]
down_genes <- normalized_count_data[down_genes, ]
diff_genes <- rbind(up_genes, down_genes)
# write.table(rownames(up_genes), file = "up_genes.txt", quote = F, row.names = F, col.names = F)
# write.table(rownames(down_genes), file = "down_genes.txt", quote = F, row.names = F, col.names = F)
write.table(rownames(diff_genes), file = "diff_genes.txt", quote = F, row.names = F, col.names = F)
Which method did you choose and why?
I used g:profiler for the thresholded gene set enrichment analysis.
What annotation data did you use and why? What version of the annotation are you using?
I chose GO:BP, KEGG and REAC as my annotation data.
GO biological process: releases/2019-07-01 KEGG: KEGG FTP Release 2019-09-30 Reactome: ensembl classes: 2019-10-2
How many genesets were returned with what thresholds? The threshold is 0.05.
Upregulated genes: 1353 go bioligical process terms, 110 KEGG terms and 103 REAC terms
Downregulated genes: 657 go bioligical process terms,22 KEGG terms and 376 REAC terms
Diff genes: 538 go bioligical process terms, 31 KEGG terms and 63 REAC terms
Run the analysis using the up-regulated set of genes, and the down-regulated set of genes separately. How do these results compare to using the whole list (i.e all differentially expressed genes together vs. the up-regulated and down regulated differentially expressed genes separately)?
#####Up-Regulated Genes
#####Down-Regulated Genes